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Abstract 

Optimising queries in real-world situations under imperfect conditions is still 
a problem that has not been fully solved. We consider finding the optimal order in 
which to execute a given set of selection operators under partial ignorance of their 
selectivities. The selectivities are modelled as intervals rather than exact values 
and we apply a concept from decision theory, the minimisation of the maximum 
regret, as a measure of optimality. We show that the associated decision problem 
is NP-hard, which renders a brute-force approach to solving it impractical. Nev¬ 
ertheless, by investigating properties of the problem and identifying special cases 
which can be solved in polynomial time, we gain insight that we use to develop 
a novel heuristic for solving the general problem. We also evaluate minmax re¬ 
gret query optimisation experimentally, showing that it outperforms a currently 
employed strategy of optimisers that uses mean values for uncertain parameters. 

1 Introduction 

Although query optimisation in database management systems (DBMSs) has been a 
topic of research for decades, there are still important unresolved issues. In his recent 
blog post [21], Guy Lohman highlights errors made in estimating cardinalities as a 
crucial factor. These kinds of errors cause optimisers to generate query execution plans 
that are way off the target in terms of efficiency. Consequently, an optimiser should try 
to avoid potentially bad plans rather than strive for an optimal plan based on unreliable 
information. 

For typical workloads, a DBMS can compile statistical data over time to obtain a 
fairly accurate picture. For instance, estimating the selectivities of simple predicates on 
base relations in a relational database is fairly well understood and can be done quite 
accurately [12, 15]. However, the situation changes once systems are confronted with 
very unevenly distributed data values or predicates that are complex. 

Trying to estimate selectivities in dynamic settings, such as data streams [30], or 
in non-relational contexts, such as XML databases [27, 32], also poses challenges. It 
may even be impossible to obtain any statistical data, because the query is running on 
remote servers [31]. Detailed information may also not be available because a user 
issues an atypical ad-hoc query or utilises parameter markers in a query. We propose to 
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use techniques from decision theory for making decisions under ignorance', meaning 
that we know what the alternatives and their outcomes are, but we are unable to assign 
concrete probabilities to them [26]. 

In our approach we propose to build a robust query optimiser that is aware of the 
unreliability of database statistics and considers this during optimisation. When execut¬ 
ing a query, the DBMS encounters a particular instance of concrete parameter values: 
we call this a scenario. The problem is that, during the prior optimisation step, the 
optimiser does not know which scenario the DBMS will face during plan execution. 
Additionally, it is highly unlikely that there is a single execution plan that will yield the 
optimal cost for every potential scenario. Consequently, our goal is to choose a query 
execution plan that performs reasonably well regardless of the scenario it encounters. 
More specifically, we try to minimise the difference between the cost of a plan p and 
the cost of the optimal plan when p is executed under its worst-case scenario. This is 
called minmax regret optimisation (MRO), which is a well-known technique for mak¬ 
ing decisions under ignorance. Previous work on query optimisation has considered 
measures of robustness for query plans [3, 5, 23], but not in terms of MRO. 

In this paper, we focus on the selection operator a, an operator common to many 
data querying languages. Selection is sometimes called a filter operator in contexts 
such as data stream processing [2, 4] and sensor networks [10], where there is renewed 
interest in improving the efficiency of processing these operators. A very common 
setting is determining the order in which to apply a set of commutative filters to a 
stream or a set of data items, e.g. tuples of a relation, so as to keep the processing costs 
to a minimum. 

There are well-known techniques for ordering selection operators to filter out as 
many tuples as possible as early as possible at the lowest possible cost [14]. However, 
these techniques rely on having accurate values for the operators’ selectivities, i.e., the 
percentage of tuples passing a filter, and their processing costs (per tuple). Getting the 
estimation of selectivities (and/or costs) wrong can lead to high overall costs for the 
pipelined execution. 

Our technique is based on using intervals rather than exact values for describing 
selectivities, aiming at generating query plans that are minmax regret optimal. How¬ 
ever, identifying such plans, even for selection ordering, turns out to be NP-hard. As a 
result, we leave the investigation of further operators for future work and focus first on 
finding a good heuristic for MRO selection ordering. 

Intervals can provide a useful way to model selectivities when exact values are 
unknown or hard to compute. For example, Babu et al. [5] compute intervals from 
single-point estimates in order to model levels of uncertainty regarding the accuracy of 
estimates, based on how such estimates were derived. Moerkotte et al. [24] consider 
histograms which guarantee a maximum multiplicative error (called the q-error) for 
cardinality estimates. Given such an estimate, the true cardinality (selectivity) can 
easily be modelled by an interval, as we show in Section 2. 

For another situation in which interval selectivities arise, consider estimating the 
selectivities of string predicates which perform substring matching using SQL like, 
a problem known to be difficult [6]. As an example, let us consider a database in which 
email messages are stored in a relation emails, with attributes such as sender, 
subject and body (the textual contents of the email). Assume that many queries use 
selection predicates such as subject like '%invest%', so the database main- 

* Sometimes these are also called decisions under uncertainty. We refer to them as decisions under igno¬ 
rance to distinguish them from probability-based methods. 
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tains indexes on words and on 2-grams (say) of words which allow it also to provide 
selectivities for these. 

Although the database maintains an index on words, the selectivity for the word ‘in¬ 
vest’ will be an underestimate for the selectivity of sub ject like '%invest%' 
since the strings ‘reinvest’ and ‘investigation’ (and many others) also match this pred¬ 
icate. Even if we are able to enumerate all words containing the string ‘invest’, we do 
not know how to combine their individual selectivities into a single selectivity. Instead 
we can use an interval selectivity with the exact match as a lower estimate. As the 
upper estimate, we can use the minimum selectivity of all the 2-grams of ‘invest’ since 
any string containing ‘invest’ must contain all of its 2-grams as well. 

Example 1. Ai a concrete example, consider the following query on the Enron email 
data^: 

select sender 
from emails 

where body like '%action%' and 
body like '%like%' and 
subject like '%use%'; 

Let us denote the three predicates by A, L and U (for ‘action’, ‘likes’ and ‘use’). The 
interval selectivities for the three predicates, as computed using the method proposed 
above and explained in more detail in Section 8, are [0.03, 0.68]/or A, [0.17, 0.27] for 
L and [0.0008, 0.06] for U. Even if we consider only the upper and lower bounds of 
these intervals, they give rise to 8 possible scenarios. No single plan (order) is optimal 
for all 8 scenarios, so the best we can do is find the plan which minimises the maximum 
regret. This plan corresponds to the order UAL. The maximum regret for this plan 
arises in the scenario when U has its maximum selectivity, while A and L have their 
minimum selectivities (in this case, the predicates U and A should be swapped to get 
the optimal order). 

In the case of the above query, our heuristic finds the minmax regret optimal solu¬ 
tion. By way of contrast, an alternative heuristic such as that which takes the midpoints 
of the intervals and produces an optimal ordering based on those, produces the plan 
ULA. This plan has a maximum regret which is 44% worse than the minmax regret 
optimal plan. 0 

We should mention that the technique of using intervals can be applied to other 
approximate or error-tolerant queries as well. All we need is the selectivity for an exact 
query as the lower bound and the selectivity for a query that determines a candidate set 
with false positives as the upper bound. 

Our contributions in this paper are as follows: 

• We formalise the problem of optimal selection ordering under partial ignorance, 
i.e., when selectivities are given as intervals. 

• We identify a number of properties of the problem, including that (i) only ex¬ 
treme scenarios (i.e., in which each operator takes on its minimum or maximum 
selectivity) need to be considered, (ii) operators which dominate others (i.e., both 
their maximum and minimum selectivities are smaller) must appear before the 
dominated ones in any optimal plan, and (iii) the decision version of the problem 
is NP-hard. 

^http://WWW.cs.emu.edu/-./enron/ 
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• We investigate a number of special cases in which selection ordering under par¬ 
tial ignorance can be solved in polynomial time. Along the way, we also identify 
other important properties of scenarios in MRO selection ordering. 

• Based on our findings we develop efficient optimisation heuristics, which we 
evaluate experimentally, using synthetic data, the Enron email data, and the Star 
Schema Benchmark (SSB) [28]. The experiments demonstrate the benefit of 
using minmax regret optimisation, in some cases halving the deviation from the 
optimal plan compared to conventional techniques. 

The remainder of this paper is organised as follows. We start by reviewing related 
work on selection ordering and optimisation techniques in the next section. In Sec¬ 
tion 3, we formalise the problem of selection ordering under partial ignorance, using 
minmax regret optimisation as the criterion for optimality. Various properties of the 
problem are identified in Section 4, while the proof of NP-hardness is given in Sec¬ 
tion 5. Section 6 presents some special cases of the problem which can be solved in 
polynomial time. Our heuristic algorithm is given in Section 7, with its experimental 
evaluation presented in Section 8. Finally, we conclude in Section 9. 

2 Background and Related Work 

We assume we are given a set S' = {ai, (T2, ..., cr„} of selection operators, or equiv¬ 
alently a conjunctive predicate pi A p2 A • • -pn- The selectivity Si of operator ai or 
predicate pi is the fraction of tuples that satisfy the operator or predicate. Associated 
with each operator si is also a cost Ci, which is the cost per tuple of evaluating the 
operator. 

Most database systems keep statistics allowing them to estimate the selectivity for 
single attributes fairly accurately. For the joint selectivity of multiple attributes, much 
early work and many systems make the attribute value independence (AVI) assump¬ 
tion. This assumes that the selectivity of a set of operators {ai^ ^ai^ ... ai^ } is equal to 
Si^x Si^x ■ ■ ■ X Si^. If instead a system stores (some) joint selectivities (it is infeasible 
for it to store all of them), we can use the AVI assumption to “fill in the gaps” or use 
the estimation approach advocated in [22]. 

2.1 Selection Ordering 

Assuming we have accurate values for the selectivity Si and cost Ci of selection operator 
ai, we can calculate the rank of ap. 

Ti = {Si - l)/c* (1) 

Given a set of selection operators, sorting and executing them in non-decreasing order 
of their ranks results in the minimal expected pipelined processing cost [19] under the 
AVI assumption. Clearly, the computation of the ranks and the sorting can be done in 
polynomial time. A similar argument applies if a query uses a conjunction of predicates 
on the same relation, and query evaluation uses a simple table scan. In such a case, the 
optimiser should test the predicates in the order which minimises the total number of 
tests. Basically, ordering selection operators optimally is a solved problem, but only 
when given exact values for the Si and q. 

Similar optimisation problems have been studied in the context of sequential test¬ 
ing. Here the goal is to find faulty components as quickly as possible by testing them 
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one by one. Each component has a probability of working correctly and a cost for 
testing it. One of the earliest proposed solutions [16] relies on ranking the components 
and then ordering them by their ranks, very similar to the selection ordering described 
above. 

2.2 Optimising under Uncertainty 

In the following, we review different approaches for dealing with uncertain parameters 
during query optimisation. A common approach of many optimisers is to use the mean 
or modal value of the parameters and then find the plan with least cost under the as¬ 
sumption that this value remains constant during query execution, an approach called 
Least Specific Cost (ESC) in [7]. As Chu et al. point out in [7], if the parameters vary 
significantly, this does not guarantee finding the plan of least expected cost. 

An alternative is to use probabilistic information about the parameters fed into the 
database optimiser, an approach known as Least Expected Cost (EEC) [7]. (A discus¬ 
sion regarding the circumstances under which EEC or ESC is best appears in [8].) In 
decision-theoretic terms, we are making decisions under risk, maximising the expected 
utility. However, probability distributions for the possible parameter values are needed 
to make this approach work, whereas in our case we do not have these prerequisites. 

In parametric query optimisation several plans can be precompiled and then, de¬ 
pending on the query parameters, be selected for execution [11]. However, if there is 
a large number of optimal plans, each covering a small region of the parameter space, 
this becomes problematic. Eirst of all, we have to store all these plans. In addition, con¬ 
stantly switching from one plan to another in a dynamic environment (such as stream 
processing) just because we have small changes in the parameters introduces a con¬ 
siderable overhead. In order to amend this, researchers have proposed reducing the 
number of plans at the cost of slightly decreasing the quality of the query execution 
[9]. Our approach can be seen as an extreme form of parametric query optimisation by 
finding a single plan that covers the whole parameter space. 

Another approach to deal with the lack of reliable statistics is adaptive query pro¬ 
cessing, in which an execution plan is re-optimised while it is running [2, 5, 17, 23]. 
It is far from trivial to determine at which point to re-optimise and adaptive query pro¬ 
cessing may also involve materialising large intermediate results. More importantly, 
this means modifying the whole query engine; in our approach no modifications of the 
actual query processing are needed. A gentler approach is the incremental execution 
of a query plan [25]. Deciding on how to decompose a plan into fragments and putting 
them together is still a complex task, though. 

Estimates based on intervals arise explicitly in [5] and implicitly in [24]. As men¬ 
tioned in the Introduction, Babu et al. [5] use intervals to model uncertainty in the ac¬ 
curacy of a single-point estimate. Uncertainty is represented by a value from 0 (none) 
to 6 (very high). Upper and lower bounds for the single-point estimate are then cal¬ 
culated using the estimate and the uncertainty value. During optimisation, only three 
scenarios, those using the low estimates, the exact estimates and the high estimates, 
are considered, rather than all scenarios as in our approach. Moerkotte et al. [24] study 
histograms which provide so-called q-error guarantees. Given an estimate s for s, the 
q-error of s is max(s/s, s/s). An estimate is g-acceptable if its q-error is at most q. So 
if an estimate s is g-acceptable, the true value s lies in the interval 1/qx s < s < qx s. 

Notions of robustness in query optimisation have been considered in [3, 5, 23]. 
Babcock and Chaudhuri [3] use probability distributions derived from sampling as well 
as user preferences in order to tune the predictability (or robustness) of query plans 
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versus their performance. For Markl et al. [23], robustness means not continuing to 
execute to completion a query plan which is found to be suboptimal during evaluation; 
instead re-optimisation is performed. On the other hand, Babu et al. [5] consider a plan 
to be robust only if its cost is within e.g. 20% of the cost of the optimal plan. None 
of these papers consider robustness in the sense of MRO. Moreover, these techniques 
need additional statistical information to work. 


2.3 Optimising under Ignorance 

Minmax regret optimisation (MRO) has been applied to a number of optimisation prob¬ 
lems where some of the parameters are (partially) unknown [1]. The complexity of the 
MRO version of a problem is often higher than that of the original problem. Many 
optimisation problems with polynomial-time solutions turn out to be NP-hard in their 
MRO versions [1]. 

One example is minimising the total flow time (TFT), in which n jobs are scheduled 
on a single machine [18]. The^ow time of a job is the sum of its processing time and 
the time it has had to wait before starting execution. The total flow time is the sum of 
the flow times of all n jobs. This scheduling problem can be solved in polynomial time 
given exact job lengths (by sorting the jobs in non-decreasing order of their processing 
times [20]), but becomes NP-hard in its MRO variant [20]. Researchers have developed 
approximation algorithms for the problem; for example, a 2-approximation algorithm, 
bounding the approximate solution to be no more than twice the optimal solution, is 
proposed in [18]. 

Among all MRO problems, TFT is the one closest to the problem we are inves¬ 
tigating. However, there are substantial differences: the formula for computing the 
cost of a schedule is much simpler for TFT, and the approach chosen to obtain a 2- 
approximation does not guarantee a bound for MRO selection ordering, as we show in 
Section 4. 


3 Selection Ordering MRO 

In this section we give a formal definition of the generalised selection ordering problem 
with partially defined selectivities. The exact costs of selection operators can also be 
unknown, but for the moment we restrict ourselves to partially defined selectivities. 

3.1 Basic Definitions 

We start out with definitions for selection operators with interval selectivities and basic 
properties. 

Definition 1. Given a set S = {cri,CT 2 ,... ,cr„} of selection operators, each has a 
selectivity Si and a cost Ci. Each selectivity is defined by a closed interval: for 1 < i < 
n. Si = [Sj, Si] with s^,Si G [0,1] and s^ < Si. For 1 < i < n, Ci G represents the 
cost of Gi for processing an input tuple. 

Depending on their selectivity intervals selection operators may relate to each other 
in a special way. Later on we exploit this property in order to optimise selection orders. 

Definition 2. Given two selection operators Gi, Gj G S, we say that Gi dominates Gj 
if < Sj and Si < Sj. The set S of operators is called dominant if for each pair 
Gi, Gj G S it is the case that either Gi dominates Gj or Gj dominates Gi. 
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Later on, it will be helpful to consider a special case of dominant sets of operators. 


Definition 3. Given two selection operators Ci, aj G S, we say that ui strictly dom¬ 
inates Gj if Si < Sj. A strictly dominant set is defined analogously to a dominant 
set. 

If for two selection operators Gi, Gj G S, neither Gi dominates gj nor gj dominates 
Gi, then Gi and Gj form a nested pair of operators. So, operator Gi is nested in Gj if 

Sj < Si and Si < Sj. 

Example 2. Let S = {gi, G 2 , g^} be a set of selection operators, with selectivities 
Si = [.2, . 8 ], S 2 = [.3, .5] and S 3 = [.1, .4]. Operator 173 dominates both gi and g^, 
but does not strictly dominate either of them. Because G 2 is nested in gi, the set S is 
not dominant. 0 

Definition 4. An assignment of a concrete value to each of the n selectivities is called 
a scenario and is defined by a vector x = (si, S 2 ,..., s„), with Si G [Si, Si]. 

Every time we actually run a query, we encounter one scenario. However, during 
the optimisation step we are unaware of which scenario we will face. The set of all 
possible scenarios can be described by X = {a; | a; G [si, Si] x [53, S 2 ] x... x [s„, s„]}. 
There are certain scenarios we are particularly interested in: 

Definition 5. A scenario Xext = (si, S 2 ,..., s„) is called an extreme scenario if for 
each I < i < n. Si is equal to either Si or Si. 

Let tt” be the set of all possible permutations over 1,2,..., n. For Wj G tt", TTj{i) 
denotes the i-th element of iTj. 

Definition 6. A query execution plan pj is a permutation GTr( 2 ), ■ ■ ■, CTtt («) of 

the n selection operators. The set of all possible query execution plans is given by 

P = {p\p= CT^(l),Cr7r(2),- - ■ ,cr7r(n) SUch that TT G Tt”}. 

The cost of evaluating plan pj under a given scenario x is 


CoSt(pj,x) -f S.x{\)^'x{2) “b St^^^IJ S7i.( 2) C7r(3) 


n—1 

“b * ' ' “t“ ^ 7T (i) ^TT (n)^ 

i=l 



( 2 ) 


LI is the cardinality of the relation on which we execute the selection operators. Cur¬ 
rently we make the AVI assumption that the selection predicates are stochastically in¬ 
dependent. Extending our approach to situations in which (some) joint selectivities are 
known is a topic for future work. 

Example 3. Recall the set S = {gi,G 2 ,G 3 } of selection operators from Example 2, 
with selectivities si = [.2, . 8 ], S 2 = [-3, .5] and S 3 = [.1,.4]. There are 8 extreme 
scenarios for this example, one being given by scenario Xi = (si, S 2 , S 3 ) = (.2, .3, .1). 
One the the 6 possible plans for S is given by plan pi = G1G2G3. Assuming that LI 
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and each cost Ci is set to 1 , we can calculate the cost of plan pi under scenario xi, 
Cost{pi, Xi), using Equation (2) as follows: 

Cost{pi,Xi) = (1 + .2 + .2 X .3) = 1.26 


0 

Let Popt{x) stand for the query execution plan having the minimal cost for scenario 
X, and let Tropt(x) be the permutation of the selection operators for this plan. Since we 
are facing multiple scenarios, the criterion for evaluating the optimality of a plan pj is 
different to the one used in the classical selection ordering problem. We utilise minmax 
regret optimisation to determine the quality of a plan. 

3.2 Minmax Regret Optimisation 

Below we define the regret for a plan given a scenario, the maximal regret for a plan, 
and finally the problem of finding a plan that minimises the maximal regret. 

Definition 7. Given a plan p and a scenario x, the absolute regret 7 (p, x) of p for x 
is: 


l{p, x) = Cost{p, x) - Cost{popt(x), x) (3) 

where Popt{x) is optimal plan for scenario x. The maximal regret of a plan is the 
regret for its worst-case scenario and is simply defined as maxa;ex( 7 (p, a;)). 

Definition 8 . Given the set P of all possible execution plans and the set X of all 
possible scenarios, minimising the maximal regret is done as follows (where R{P,X) 
is the optimal regret): 

R(P,X)= inmp^p(m.ax^^x{l{p,x))) 

Given a set S of selection operators, let P(S) denote the set of possible plans for S and 
X{S) denote the set of possible scenarios for S. Then the minmax regret optimisation 
problem for S, which we denote MRO(S), is to find a plan whose maximum regret 
matches R(P(S), X(S)). For simplicity and when there is no confusion, we also use 
MRO(S) to denote a plan which minimises R(P(S), X(S)). 

Example 4. Recall once again the set S = {o'l, a 2 , cr^} of selection operators from 
Examples 2 and 3, with selectivities si = [.2, . 8 ], S 2 = [.3, .5] and S 3 = [.1,.4]. 
For simplicity, assume that all operators have the same cost 1 and that the relation 
has cardinality = I (so to get the real costs, the numbers in Table 1 just have to 
be multiplied by the true cardinality). To find the plan which minimises the maximum 
regret, we can perform an exhaustive enumeration of all possible execution plans under 
every possible scenario. We show later in Theorem 1 that it is sufficient to consider only 
the extreme scenarios since the worst case scenario for any plan is always an extreme 
one. Hence, if there are n operators, we need to consider n\ different execution plans 
under each of 2" extreme scenarios. For our example. Table 1 shows the 48 regret 
values for the 6 possible plans under each of 8 extreme scenarios. 

For example, recall from Example 3 that the cost of the first plan pi = (Ti(T 20’3 
under scenario XI = (sj^, 52 , 33 ) = (.2, .3, .1) is 1.26. The optimal plan for 

any scenario x is one in which the operators are in non-decreasing order of their 




51 

5 2 

5 3 

51 

5 2 

5 3 

51 

5 2 

5 3 

51 

52 

5 3 

51 

5 2 

5 3 

51 

5 2 

5 3 

51 

5 2 

5 3 

51 

5 2 

5 3 

Max 

Regret 

0'iCr2Cr3 

0.14 

0 

0.18 

0.02 

0.91 

0.62 

1.05 

0.6 

1.05 

(TlCr3(T2 

0.1 

0.02 

0.1 

0 

0.75 

0.7 

0.73 

0.52 

0.75 

(T20’icr3 

0.24 

0.1 

0.48 

0.32 

0.41 

0.12 

0.75 

0.3 

0.75 

(T2CT3Cri 

0.21 

0.16 

0.43 

0.42 

0.2 

0 

0.4 

0.1 

0.43 

'73(71(72 

0 

0.22 

0 

0.2 

0.05 

0.3 

0.03 

0.12 

0.3 

(73(72(71 

0.01 

0.26 

0.03 

0.32 

0 

0.1 

0 

0 

0.32 


Table 1: The regret for each plan under each scenario in Example 4. 


selectivities. Therefore, the optimal plan for scenario Xi is Popt(xi) = 0 ’ 3 (TiCT 2 and its 
cost is: 

Cost{popt{xi),Xi) = (1 + .1 + .1 X .2) = 1.12 
The regret of plan pi under scenario Xi using Equation (3) is: 

'y{Pi,Xi) = Cost{pi,xi) - Cost{p„pt,^^^pXi) 

= 1.26-1.12 = 0.14 

In order to find the minmax regret solution, the maximum regret of each plan needs to be 
found. For plan pi, the maximum regret is 1.05 which occurs in scenario {si,S 2 ,s^), 
its worst-case scenario. The maximum regret for each plan is shown in bold face in 
Table 1. 

Finally, we are looking for the plan with the smallest maximum regret (i.e., the 
smallest value in the last column of Table 1). As a result the minmax regret solution, 
MRO{S), is plan cr 3 (Ticr 2 , which has the best performance among all plans when 
confronted with their worst-case scenarios. 0 

In the above example, it is interesting to consider which scenario gives rise to the 
maximum regret for each plan. Note that for each plan its worst-case scenario is one in 
which the operators in some initial sequence in the plan each take on their maximum 
selectivity followed by the remaining operators taking on their minimum selectivity. 
We call such a scenario a max-min scenario. 

Definition 9. Fet p be the plan cr^( 2)5 ■ ■ ■, cr^(n)- ^ scenario for p is called a 

max-min scenario if there is a 0 < k < n such that for all 1 < i < k, s^(i) = 
and for all k 1 < i < n, s,r(i) = S 7 r(i)- 

So the first k operators in p take on their maximum selectivity, while the rest take on 
the minimum. Note that for a plan p with n operators, there are n -f 1 max-min scenar¬ 
ios. Max-min scenarios are the only scenarios considered by the max-min heuristic we 
develop in this paper. However, it is important to state that, in general, the worst-case 
scenario for a plan may not be a max-min scenario. 

4 Properties of MRO 

Before presenting algorithms for solving the MRO selection ordering problem, we 
identify some of its important properties. 
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4.1 Extreme Scenarios 


In order to determine the worst-case scenario of a plan, i.e., the scenario for which a 
plan exhibits its largest regret, we only have to check extreme scenarios. 

Theorem 1. The worst-case scenario for any query plan p is always an extreme sce¬ 
nario. 

Proof. We introduce the following notation to show that our cost formulas are piece- 
wise linear functions: 

LUy) ■■= E 

i=l,i^x 
n—1 

Kiy) ■■= E 

i=y.^i^x 



where LJ (j/) computes the cost of plan p with the operator permutation tt up to the 
operator at position y. We skip the operator at position x, i.e., the summand in which 
STr{x) appears first is left out of the sum and s^(a;) is omitted in all products. Analo¬ 
gously, we define Rf. (y) which computes the cost to the end of the plan starting from 
position y. If we do not want to skip any operators, we simply write L^{y) or R^{y). 

Expressing the costs of p and Popt(x) a function of Sm- 

Cost(p,x, Sm) = {v - 1).SmRliv - 1) 

CoSt{popt(x) , X, Sm) = (ti; - 1) -f SmRZ^”*^"^ {w - 1) 

we see that Cost(p, x) is a linear function in Sm- Cost{popt{x),x, Sm) is linear as long 
s^opt(x) - 1) < s™ < s^opt(x) + !)■ If ■Sm leaves this range, then Popt(x) will 
change, as all operators are sorted in ascending order of their selectivity. Nevertheless, 
Cost{popt(x)iX, Sm) is a piecewise linear function. Clearly, we can swap the positions 
of two operators in an optimal plan without changing its optimality if the operators 
have exactly the same selectivity. So if Sm = 'S7r„pt(^) (w — 1) = ... = {w — k), 

then we can swap am with a^-k- Analogously, if Sm = ■S'n-opt(p) (w -f 1) = ... = 
s^ppj(^) {w -\- k), then we can swap am with (Ttp+fc- For the cost of the optimal plan, this 
means Cost(^popt(x)^ Xj Sm) 


L^opt(p) (u; _ fc _ 1) + SmRw”*^'^'’ (w-k-l) 

if S^opt(p) (w-k-l) <Sm< S7r„p*(p) (w - k) 
L^opt(x) - 1) + (w - 1) 

if S’^opt(p) (w' - 1) < Sm < (w -f 1) 

(W + fc) + SmRw”"^^^ (W + k) 
if Sii-pptfp) {w k) < Sm < s^pp,(^) (w -I- fc -I- 1) 


Figure 1 illustrates the cost functions forp ttnd Popt{x)- 

We show that Cost{popt(x)jX, Sm) is a concave (or convex upwards) function. For 
our piecewise linear function this means proving that by increasing Sm (moving into 
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Figure 1: Visualisation of Cost(p, x, Sm) and Cost{pgp^(^^'j,x, Sm) 


the next piece) the slope will never increase, while if we decrease s™, the slope will 
never decrease. 

Increasing will change the slope of Cost(p^pf(^p x, Sm) from (w- 1) 

to {w + k). {w + k) is less or equal than {w — 1), as they are 

identical except for the additional summands ru—Itow + A:—lin {w — 1). 

Analogously, decreasing Sm will change the slope from to {w— 

k — 1) (which is greater or equal). 

So 7 (p, x) = (Cost(p, x) — Cost{popt{x),x)) is a convex function, whose domain 
is restricted to a polyhedral convex set, defined by the lower and upper bounds of the 
select!vities. The global maximum of such a function is always found at one of the 
extreme points of the polyhedral convex set (Corollary 32.3.4 in [29]). □ 

4.2 Domination 

We can determine the relative order two operators have to be in to minimise the maxi¬ 
mal regret if one operator dominates the other. 

Theorem 2. If a a dominates at, then there exists a plan p minimising the maximal 
regret in which a a precedes at. 

Proof Assume that p is a plan minimising the maximal regret in which ab precedes 
(Ja- 7r(w) = b and Tr{w + k) = a. Furthermore, assume that p' is constructed from p 
by swapping and = a and Tr'{w + k) = b. All the other operators are 

in exactly the same order as in p. We assume that p' does not minimise the maximal 
regret. 

Let us investigate the difference in regret between p' and p for any given scenario 
X. Since the optimal plan is the same for both regrets 

w+k—l / i 

CoSt(p , x') CoSt(p, x) ^ ^ I 

i=w \^j=l,j^w 

we only need to check what happens between positions w and w + k — 1, as {w — 1) 
and R^ {w + k) (see the proof of Theorem 1 for the meaning of this notation) are 
identical for both p and p'. 
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Figure 2: Visualisation for scenario y” 


Because this holds for every scenario, it also holds for the worst-case scenario y' 
of p'. Let us assume first that the selectivities for a a and ab in y' are either (sa, Sf,), 
{sai'Sb), or (^,^). Since CTq dominates ab, we know that Sa < Sb, Sa < and 
^ <Jb- Therefore, 


w-\-k— 1 

(Sa - Sb) 

i—w 



< 0 


which means that the maximal regret of p' cannot be greater than that of p. This also 
holds for strict domination, i.e., when ^ < Sb- However, this is a contradiction to our 
assumption. Thus, for the worst case scenario y', we must have the selectivities (^, s?,) 
with > Sb- 

Let us look at a scenario y" which is identical to y' except for Sa = Sb = Sc with 
> Sc > Sb (see Figure 2). From Theorem 1 we know that the regret can be increased 
by moving to an extreme scenario. In this case Sa has to be increased from Sc to and 
Sb has to be decreased from Sc to Sb to reach the maximal regret 7 (p', y'). 

Clearly, 7 (p, y”) = 'y{p', y”). The following is illustrated in Figure 2. Increasing 
Sb from Sc to s^ and decreasing Sa from Sc to Sb for p under scenario y” (dotted 
arrows) will have exactly the same effect as increasing Sa and decreasing Sb for p' 
under scenario y” (solid arrows). However, this may not be an extreme case scenario 
for p yet. Further increasing st to Jb and decreasing Sa to Sa can never decrease the 
regret (according to Theorem 1). But that means we have found a scenario for p which 
has at least the same regret as the worst-case scenario for p', which contradicts our 
assumption. □ 

Example 5 . Recall from Example 4 the set S = {ci, (T2, <73} of selection operators, 
with selectivities si = [.2, .8], S 2 = [.3, .5] and S3 = [.1, .4]. Because a^ dominates ai 
and CT 2 , in the minmax regret solution, i.e. plan a 3 aia 2 , a^ precedes ai and CT 2 . As a 
result of domination, in this example we would only have to consider two plans when 
searching for the minmax regret solution. 0 
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4.3 Midpoints of Intervals 

For the TFT problem, Kasperski used the simple heuristic of sorting jobs in non¬ 
decreasing order according to the midpoints of their intervals, yielding a 2-approx- 
imation [18]. This approach does not guarantee a bound for MRO selection ordering; 
as shown below, the quality of the solution can become arbitrarily bad. 

Given 2n + 1 operators, the first n operators have the selectivities = 0 and 
s7 = 1 (1 < f < n), while the next n operators have the selectivities Si =~sl = 0.5 + e 
(n + 1 < i < 2n) for some small e. The final operator has a constant selectivity of 1 
to guarantee that it will always be in last position, meaning that its selectivity will not 
impact any further steps. 

The midpoint heuristic will order the operators in exactly this way, from 1 to 2n+1. 
Clearly, the worst-case scenario for this plan is when Si is set to 1 for 1 < Si < n. In 
the optimal plan for this scenario, the operators with n+1 < i < 2n will be executed 
first. 

The regret of this plan is computed as follows: 

1 + ... + 1” + /(n) 

- (0.5-fe) - (0.5-fe)2 ... - (0.5-f e)” - g{n) 

where /(n) and g{n) stand for the cost of the remaining operators in the plan. A lower 
bound for this expression is the following, since /(n) > g{n) (see Lemma 3 below): 

n — n(0.5 + e) 

With increasing n and small values for e, this expression can get arbitrarily large. 

Lemma 3. Given a query plan p and a scenario x, we have the following relationship 
between the summands in Cost{p, x) and Cost{popt{x)j x), where Popt(x) is the optimal 
plan for scenario x: 

k k 

n ^ n all k with 1 < fc < n - 1 

Proof If there exists an Sm = 0, then 11^=1 ~ ® ^ above 

holds, as Si > 0 for all i. This is due to = Sm = 0 (Popt(x) sorts the selec¬ 

tions in non-decreasing order of their selectivities according to the ranking algorithm). 
Thus, in the following all Si > 0. 

We assume there is a fc for which 11^=1 ^ 11^=1 (proof by contra¬ 

diction). Let TT^ = {ttii) I 1 < t < fc} be the set of indexes of the first k selection 
operators in p and = {T^opt{x){i) | 1 < * < fc} the set of indexes of the first k 

selection operators in Popt{x)- If then the two products are equal, which 

is a contradiction to our assumption. So in the following we assume tt* f tr^pt(x)- 
Nevertheless, the intersection between and may be non-empty. In this 

case we can discard all the selectivities common to both products: 


n 

n 

< n 

sj n 






n 

< n ^ 

^3 
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Therefore, there exists i G tt^ \ such that Si < max(s; | I G \ (or 

the inequality would not hold). However, that means ai has appeared in p but not yet 
in Popt{x)- This is a contradiction; the selection operators are sorted in non-decreasing 
order in Popt(x) ai should have appeared in Popt{x) before the selection operator 
with selectivity max(s/ | I G \ □ 

5 Hardness of MRO 

In this section, we show that the decision problem for general MRO{S), which we call 
MINMAX REGRET, is NP-hard. In this version, we are given a set S' = {ai,a2, ■ ■ ■ ,an} 
of selection operators, with each operator ai assumed to have unit cost. We are also 
given a set X = {Xi,X 2 , ■ ■ ■, X^} of scenarios, where each scenario Xj specifies a 
selectivity Sij for each operator ai, 1 < j < m and 1 < i < n. 

To simplify the notation, let us identify a plan p with the permutation tt it defines, 
and from now on use 7r(i) to denote the index of the operator appearing in position i in 
plan TT. 

Below we define the decision problem MINMAX REGRET as well as the well-known 
NP-complete problems SET COVER and EXACT COVER BY 3-SETS. 

MINMAX REGRET: given a set S of n selection operators, a set X of m scenarios, and 
a real number R, is there a plan whose maximum regret is less than R1 

SET COVER; given a finite set A, a collection T of subsets of A, and a positive integer 
r, is there a subset C = {Ci, ..., Cr} of T such that Uc eC 
C covers AI 

EXACT COVER BY 3-SETS: given a finite set A with |A| = 3g and a collection T of 
3-element subsets of A, is there a subset C of T such that each element of A occurs in 
exactly one member of C? 

It is known that a restriction of EXACT COVER BY 3-SETS which requires that 
each element of the set A appears in exactly three subsets of T is NP-complete [13]. 
Since SET COVER is a generalisation of EXACT COVER BY 3-SETS, we also have that 
RESTRICTED SET COVER, defined below, is NP-complete. 

RESTRICTED SET COVER; given finite set A, collection T of subsets of A such that 
each element of A appears in exactly three subsets of T, and positive integer r, is there 
a subset C = {Ci, ..., Cr} of T such that C covers A7 

We show that MINMAX REGRET is NP-hard by reducing RESTRICTED SET COVER 
to it. 

Theorem 4. minmax regret is NP-hard. 

Proof. We reduce RESTRICTED SET COVER to MINMAX regret. Given an instance 
of RESTRICTED SET COVER represented by A, T and r, we construct an instance of 
MINMAX REGRET as follows. Let |A| = m and \T\ = n. Each subset Cj in T is 
represented by an operator aj in S, and each element Oi G A is represented by a 
scenario Xi G X such that the selectivity for operator aj in Xi, that is, Sij is l/(n -f 1) 
if Ui G Cj and 1 if Oi ^ Cj. Since each element of A appears in exactly three subsets, 
each scenario Xi G X has three selectivities of l/(n -f 1) and n — 3 selectivities of 1. 
Hence the optimal plan for each scenario has the same cost, say, p. We set i? to r — p 
and claim that there is a subset of T of size r which covers A if and only if there is a 
plan whose maximum regret over all scenarios is less than R. 
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Assume there is a subset C = {Ck ^, ■ ■ •, C'fc^} of T which covers A. Let tt be any 
plan in which 7 r(i) = aki, 1 < * < r, that is, in which the first r operators correspond 
to subsets in the cover. Since C is a cover, for no scenario Xi can it be the case that 
the selectivity for each of the first r operators in tt is 1. At worst, the first r — 1 
operators have selectivity 1, with the r’th operator having selectivity l/(n + 1), and 
the remaining n — r operators having selectivity 1. The cost of this plan is therefore 
r — 1 + (n — r + l)/(n + 1), which is always less than r. Hence the regret is less than 
R = r — p, where p is the cost of the optimal plan. 

Now assume no subset of T of size r covers A. In other words, for every subset 
of size r, at least one element of A is not in any set in the subset. Hence, for every 
plan TT, there must be some scenario in which the first r operators have selectivity 1. 
Finding a plan which minimises the maximum regret is the same as finding a plan 
which minimises the maximum cost since the cost of the optimal plan is the same for 
each scenario. Since every plan in this case has cost at least r, there is no plan whose 
maximum cost is less than r. Hence there is no plan whose maximum regret is less 
than R = r — p, where p is the cost of the optimal plan. □ 


6 Some Polynomial-Time Cases 

In this section we show that, for sets of selection operators S satisfying certain prop¬ 
erties, MRO{S) can be found in polynomial time. In particular, we look at dominant 
operators, which can easily be ordered correctly, and their combination with constant 
operators, i.e., operators for which we can obtain exact selectivity values. As before, 
we assume that the cost of each operator is one. 

6.1 Constant and Dominant Operators 

Let S' be a set of selection operators such that the selectivity of each operator can be 
estimated accurately (i.e., each selectivity is constant). Then, as mentioned in Sec¬ 
tion 2.1, MRO{S) can be found by sorting the operators in non-decreasing order of 
their rank given by Equation (1). Given our assumption that each operator has cost 
one, finding MRO{S) reduces to sorting the operators in non-decreasing order of their 
selectivity alone. 

Recall from Section 3.1 the definition of a dominant set S of operators. Given a 
dominant set S of operators, it follows from Theorem 2 that the minmax regret solu¬ 
tion is one where the operators are sorted in non-decreasing order according to their 
minimum (or maximum) selectivity value. (Note that a set of constant operators is a 
special case of a dominant set of operators.) We therefore have: 

Corollary 1 . If S is a dominant set of n operators, then MRO{S) can be solved in 
O(nlogn) time. 


6.2 Strictly Dominant Operators with a Constant Operator 

When we include nested operators (recall the definition from Section 3.1), the problem 
becomes much more difficult. As a step in the direction of solving the general problem, 
we consider below the simple case of a strictly dominant set of operators (also defined 
in Section 3.1) along with a single constant operator nested within one of the non¬ 
constant operators. If S' is a strictly dominant set of operators, then the plan MRO{S) 
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has zero regret under all scenarios. This is because all operators in MRO{S) will be 
in the same position as in the corresponding optimal plan under all scenarios. 

Let S' be a strictly dominant set which includes a constant operator CTc nested within 
one of the non-constant operators, say In this case, we know how to place the 
dominant operators relative to each other in MRO{S) but we need to determine the 
position of CTc in MRO{S). Since S. < Sc < Si, the constant operator CTc should 
be either immediately before or immediately after ai in MRO{S). Interestingly, the 
correct position for CTc depends only on the midpoint of the selectivity Si of ai. 

Proposition 1. Let S be a strictly dominant set of n operators such that MRO{S) = 
(cTi,..., (J„). Let Gc be an operator with constant selectivity Sc such that s.i^ < Sc < Si, 
for some I < i < n, and S' = S U {sc}. In MRO{S'), Gc is placed between (1) Gi-i 
and Gi if Sc < {Si + Si)/2, or (2) Gi and Gi+i if Sc > (sj + Si)/2. 

Note that if Sc = {s^ + Si)/2, then Gc can be placed either between and Gi or 
between Gi and Gi+i in MRO{S'). 

Proof The operators Gi and Gc are always neighbours in an MRO solution (appearing 
after Gi-i and before Gi+i): any operator Gh such that 1 < h < i — 1 dominates Gc 
and Gi, and Gc and Gi dominate any operator gj such that t + 1 < j < n. 

Let us assume that Gi has selectivity Sj in scenario x and selectivity Si in scenario 
x'. Moreover, let us define plan p that is constructed from MRO{S) by placing Gc 
before Gi, Tr{v) = c and tt{v + 1) = z. Similarly we define plan p' by placing Gc after 
Gi (i.e. 7r'(v) = i and Tr'{v + 1) = c). 

Case 1: The optimal plan Popt{x) places Gi before Gc in scenario x (the operators 
are sorted in non-decreasing order of their selectivities), i.e., Gi and Gc are at position v 
and f -f 1, respectively. We now compute the maximum regret of p and p' for scenario 
x: 

Cost{pcpt(x),x) = 1)(1-fs.-fs.Sc)-f -f 2) 

Cost(p, a;) = — 1)(1-I-Sc-I-ScsJ-I--I-2) 

Cost(p',a;) = L’" (w — 1)(1-I-s.-I-s.Sc)-I-i?’" (w-I-2) 

As L'"'{v — 1) = S'" (v — 1) and R''{v + 2) = K" (v + 2), we can compute the regret 
of p and p' as follows: 

7(f. x) = L''(v - 1) (sc - Si) (4) 

l{p',x) = 0 (5) 

So for scenario x, plan p has a greater regret than p' and it can be calculated by Equa¬ 
tion (4). 

Case 2: In scenario x', Gi follows Gc in Popt{x')- For computing the costs of the 
different plans, this means: 

Cost{popt(x'),x') = L’"(z; - 1)(1-I-Sc-I-ScSi)-I--I-2) 

Cost(p, a;') = — 1)(1-I-Sc-I-ScSi)-I--I-2) 

Cost(p',a;') = L’" (u — 1)(1-I-Si-I-SiSc)-I-i?’" (w-I-2) 


Consequently, the regret of p and p' is 

j{p,x') = 0 

7 (p', x') = L''(v - 1) (si - Sc) 
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(6) 

(7) 



In this case (scenario x') plan p' has a greater regret than p and it is calculated using 
Equation (7). 

Comparing both cases we can see that p has a smaller maximum regret than p' 
whenever Eq. (4) < Eq. (7). This is the case when Sc < (sj + Si) /2 and then we place 
(Tc before ai. Similarly, plan p' has a smaller maximum regret than p' whenever Eq. (4) 
> Eq. (7). We place CTc after ai when Sc > (s^ + Si)/2. Eor the breakeven point, i.e. 
Sc = {si + Si)/2, ac can be placed before or after ai. 

□ 

Proposition 1 can be generalised to the case in which each non-constant operator 
has at most one constant operator nested within it. An interesting observation about 
the situation described in Proposition 1 is that the worst-case scenario is a max-min 
scenario. 

Proposition 2. Let S be a strictly dominant set of n operators such that MRO{S) = 
(cTi ,... ,an). Let CTc be an operator with constant selectivity Sc such that < Sc < Si, 
for some 1 < i < n, and S" = S' U {sc}. The scenario (si,..., Sc, Sj,..., s„), 
in which either Oj-i or Uj is equal to ai, is a worst-case scenario for MRO{S'). 

Proof Prom Proposition 1 we know that if Sc < (s^ -f sf) /2, then the minmax regret 
is computed by Equation 4 for plan p. In plan p, ai follows CTc (so aj = af) and the 
selectivity of ai is Sj. The other selectivities do not influence the regret, so we can set 
the selectivity of the operators ai to aj-i to the upper bound and the selectivity of the 
operators aj+i to to the lower bound. If Sc > (sj -f sf) /2, then the minmax regret 
is computed by Equation 7 for plan p' . In plan p' , ac follows ai (so aj-i = af) and the 
selectivity of ai is Si. Here we choose the upper bounds for the operators cti to aj -2 
and the lower bounds for the operators aj to ct„. In both cases this results in a max-min 
scenario. □ 


7 Max-min Heuristic 

Computing the regret of every selection ordering for every possible scenario makes the 
brute-force algorithm infeasible, since there are n! different orderings and 2" scenarios, 
given n operators. So in order to find an efficient heuristic, we have to significantly 
reduce the number of orderings and scenarios. While doing so, we want to leverage the 
insights gained from our theoretical investigation. 

Let us first look at the number of possible scenarios. As we have seen in the pre¬ 
vious section, max-min scenarios seem to play a special role when it comes to the 
maximum regret of a given plan p. Intuitively this makes sense, as in an optimal plan 
many of the operators ai located towards the beginning of p with selectivities ^ will 
trade places with operators aj located towards the end of p with selectivities Sj. Con¬ 
sequently, there tends to be a large difference between the plan p and an optimal plan 
for a max-min scenario, leading to a substantial (if not maximal) regret for p. So in 
our heuristic we aim to generate plans that perform well for max-min scenarios. This 
reduces the number of scenarios we have to consider from 2" to n -f 1. 

We now turn to determining the order of the selection operators. There are two well- 
known basic methods for doing this (efficiently). The first one is constructing a plan by 
combining partial plans in a way that leads to an optimised execution order. Very often 
putting the partial plans together requires using a heuristic to solve a combinatorial 
problem. The second method is to quickly create a complete plan (e.g., by using a 
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simple heuristic) and then try to improve the plan by rewriting it (e.g., by swapping or 
removing and re-inserting operators). In our approach we wanted to have both options 
available, so we decided to develop different variants. The complexity of our heuristic 
shows slight differences depending on the variant we use; however, the algorithms we 
apply all have polynomial complexity. 

Our max-min heuristic algorithm, H {p, q), which is in fact a template for a number 
of algorithms, is shown as Algorithm 1. It is parameterised by two inputs: p, a (possibly 
empty) starting plan, and q, an order in which to process operators. Clearly, to generate 
a complete plan the union of p and q has to contain all the operators. If the intersection 
of p and q is empty, our algorithm is similar to insertion sort: in turn, we consider 
each operator in q and place it into p at the position that minimises the regret over all 
max-min scenarios. If an operator in q is already present in p, then we remove it from 
p before re-inserting it. This is equivalent to moving an operator to a different position. 
Again we determine the position minimising the regret over all max-min scenarios. 


Algorithm 1: H{p, q) 

1 foreach operator tfrom the sequence q do 

2 if f is in p then remove t from p; 

3 Assume p currently comprises i operators; 

4 foreach position j, I < j < i + I, in p do 

5 Temporarily insert t in position j in p; 

6 foreach max-min scenario for p do 

7 Calculate the regret of plan p; 

8 Store the maximum regret for position j ; 

9 Choose as the final position for f in p that which minimises the maximum 
regret; 

10 Return p; 


It is clear that the max-min heuristic runs in polynomial-time. For each partial plan 
comprising i operators, we consider i -\- \ possible positions for the next operator. In 
each of these positions, we consider i-\-2 max-min scenarios. Calculating the regret of 
a plan with n operators can be done in time 0{n log n). Hence the algorithm described 
above has an overall complexity of O(n^logn) (in the worst case i = n for every 
execution of the outer loop). However, by computing costs incrementally when an op¬ 
erator moves position and one max-min scenario moves to the next, we can implement 
the heuristic to run in time 

Example 6. Recall from Example 5 the set S = {<Ji, a 2 , cr^} of selection operators, 
with selectivities Si = [.2, .8], S2 = [-3, .5] and = [.1,.4]. Consider our max- 
min heuristic algorithm, H(p,q), with initial plan p = CTsCTi and remaining operator 
q = (72. Since p consists of two operators, should be checked in three positions: 
before a^, after ai and between them. For each position and resulting plan, the regret 
is calculated under all max-min scenarios, of which there are four in this example. 

As an example, consider the plan in which is placed between and ai. The 
regret will be calculated for the scenarios (s 3 ,S 2 j®i)’ (s 3 )S 2 jSi)> (s 3 )S 2 ,Si) and 
(s3, S2 , Si). The maximum regret for this plan is 0.3 which occurs in scenario (sa, S2, Si) 

Finally, the solution will be the plan with the smallest maximum regret, which hap¬ 
pens to be CT 3 (T 2 (Ji. As a matter of fact, the solution returned by the max-min heuristic 
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is the same as the actual minmax regret solution, as was shown in Example 4. 0 

In the following two subsections, we consider various criteria for choosing an initial 
plan and for ordering the remaining operators. 

7.1 Choosing an Initial Plan 

Even though we can run our heuristic with an empty initial plan p, i.e., building a 
solution by inserting all operators one by one, often it makes sense to start with a 
prebuilt partial plan. 

One particular and important case is that of dominant operators. Given a set S 
of operators, if we can identify a subset 5" C S' of dominant operators, we know 
that we can find an optimal solution p' for S' quickly and that the relative order of 
the operators in p' will not change in any optimal plan for S (see Theorem 2). Thus, 
taking p' as the initial plan when calling H{p, q) makes good sense. However, there 
may be different ways to choose S', as in general there may be more than one such 
dominant set. If we have more than one option, we can use the following criteria to 
make a decision: choose the subset S' (1) with the maximum cardinality or (2) whose 
operators have the largest total width. As we often encountered several subsets sharing 
the same maximum cardinality, we introduced a tie-breaker: choose the subset S' (3) 
with the maximum cardinality whose total width is greatest. In our experiments, we 
found that this third approach gave the best overall results. 

Example 7. Recall from Example 6 the set S = {ci, (T2, CT3} of selection operators, 
with selectivities si = [.2, . 8 ], S 2 = [.3, .5] and S 3 = [.1, .4]. Set S has two dominant 
subsets: Si = {cti, cr3} and S 2 = {o' 2 , 173}. Both obviously satisfy criterion (1) above, 
being of maximum cardinality. However, if we use criterion (2), namely the set which 
has operators with the largest total selectivity width, then we will choose Si since 
its total width is 0.9 while that of S 2 is 0.5. Si would also be chosen according to 
criterion (3). 

After choosing the preferable subset, we need to produce initial plan p by sorting 
the operators in nondecreasing order of their minimum (or maximum) selectivities. 
Therefore, p = a^ai when Si is chosen, while p = a^a 2 if Si is chosen. 0 

Having an initial plan allows us to combine our algorithm with other heuristics. 
We can take the output of another algorithm as our initial plan p and then refine this 
result by running H(p, q) on it. Moreover, we can use the output of H(p, q) as input 
for another iteration of our own heuristic. 

7.2 Ordering Criteria 

Since our algorithm makes only a single pass over all the operators when (re-)inserting 
them into the plan, the order in which operators are considered may have a significant 
impact on the final outcome. For example, when inserting selections into an empty 
initial plan, operators considered earlier are tested in fewer positions relative to each 
other compared to those considered later. 

We have considered two different ordering criteria in our experiments: interval mid¬ 
point (denoted by M) and interval width (denoted by W). Given a selectivity interval 
s = [s, s], the midpoint of s is (s + s)/2 while the width of s is s — s. In each case, 
operators can be ordered by non-decreasing (denoted -f) or non-increasing (denoted —) 
values. Overall, the ordering criteria are denoted by M-f, M— , W-\- and W—. So, for 
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example, W+ stands for operators being considered in non-decreasing order of their 
selectivity interval width. 


8 Experimental Results 

We evaluated the max-min heuristic experimentally, measuring the impact of different 
parameters on its performance. We also implemented the brute-force algorithm for 
finding optimal solutions in order to evaluate how well the heuristic performs. 

A commodity PC, with 8 GB RAM, Intel Core i5 processor running at 3.19 GHz 
and Windows 7 Enterprise (64-bit), was used to perform the experiments. The min- 
max regret brute-force algorithm and max-min heuristic were implemented in Java and 
compiled with the Eclipse IDE (Juno release), which is JDK compliant and uses the 
JavaSE-1.7 execution environment. The SSB queries were run on a simulation plat¬ 
form written in Ruby 1.9.3. 

8.1 Generating Test Data 

We first generated a synthetic data set to investigate the performance of our heuristic. 
Each test case corresponded to a set of k selection operators, with k ranging from 2 
to 10, and for each k we generated a hundred different sets. While fc = 2 is not hard 
to solve, it was included for verification purposes (any heuristic has to be able to find 
the optimal plan for this simple case). Ten operators was the upper limit we were able 
to solve optimally, checking 10! • (« 3.7 billion) different costs for each test case. 

Eor each set of selection operators we determined the lower and upper bounds of their 
selectivity intervals by generating 2k uniformly distributed random numbers between 
0 and 1. 

Eor real-world data, we used the Enron email data set, as introduced in Example 1. 
Once again, test queries used from 2 to 10 operators/predicates. Eor each n G [2,10], 
20 queries were generated, each with one predicate on subject and n — 1 predicates 
on body. The 20 queries were generated by randomly selecting from 40 keywords for 
subject and 45 keywords for body, and were checked to ensure that each returned 
a nonempty answer. 

We also evaluated minmax regret optimisation using a version of the Star Schema 
Benchmark (SSB) with data skew [28] (SSB itself is a variation of the TPC-H bench¬ 
mark). We generated benchmark data with a scaling factor of 1, meaning that the 
central facts table, line order, contains 6,000,197 tuples, and joined all dimensional ta¬ 
bles to the lineorder table. We then randomly picked from two to ten attributes from 
a subset of all available attributes to generate queries. Queries basically consist of a 
conjunctive predicate whose clauses are made up of the selected attributes compared 
to a random value taken from the attribute’s domain, using a less-than or greater-than 
operator. The following predicate is an example generated in our experiments: 
orderKey < 2964443 and linenumber > 5 and quantity < 29. 

8.2 Parameters 

Eor the synthetic and Enron data sets, we looked at the effects of the ordering criteria 
and the choice of initial plan on the quality of our heuristic. Additionally, we investi¬ 
gated the impact of running our heuristic multiple times, using the output of one phase 
as the initial plan of the next phase. 
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Figure 3: Histogram for attribute ordtotalprice. 


We measure the performance of our heuristic by defining the regret ratio \{S), 
which is the regret computed by H{p, q) divided by the optimal regret. More formally, 
given a set S of selection operators, let us denote the set of possible plans by P{S) and 
the set of possible scenarios by X{S). Recall from Section 3.1 that R(P{S), X{S)) 
then denotes the optimal regret. Then 

R{H{p,q),X{S)) 

R{P{S),X{S)) 

We only calculate X{S) using the above formula when the optimal minmax regret is 
non-zero. As mentioned in Section 6, the optimal minmax regret is zero only when S 
forms a strictly dominating set. For such cases, our max-min heuristic always finds the 
optimal minmax regret solution, so we define X{S) to be one. 

In view of having multiple test cases per number of selection operators, we calcu¬ 
late the average regret ratio and the worst regret ratio (simply the maximum value of 
X{S)). 

For the Enron data, we calculated selectivity intervals for the like predicates as 
described in the Introduction. For each selected keyword, we ran queries to find the 
minimum selectivity (given by exact matches of the keyword) and the maximum se¬ 
lectivity (given by the minimum selectivity of all 2-grams of the keyword). This gave 
rise to a range of intervals: those with small values such as [0.0004, 0.01] for keyword 
‘progress’ in the subject, those with larger values such as [0.6, 0.7] for ‘you’ in the 
body, and those with a big range such as [0.07, 0.6] for ‘price’ in the body. 

For the Star Schema Benchmark we created some very rudimentary histograms 
by dividing the domain of an attribute into equal-sized ranges, counting the number 
of tuples that fall into each range. We do not keep any further information on the 
distribution of tuples within each range of a histogram. For example. Figure 3 shows 
the histogram for the attribute ordtotalprice, consisting of 20 ranges each covering 
roughly 18,000 different values, e.g., bucket #1 covers the range from 1 to 17,673. 

This basic information allows us to determine intervals for the selectivities of selec¬ 
tion operators. For a “less than” / “greater than” operator, we know that all histogram 
ranges exclusively covering smaller/larger values have to be included fully. However, 
for the range the predicate value falls into, we do not know precisely how many ele- 
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ments will be selected. In extreme cases, none or all of the elements satisfy the predi¬ 
cate, giving us the lower and upper bound for the selectivity. Example 8 illustrates this 
with concrete values. 


Example 8. Given the histogram for attribute X below and the predicate X < 126, 


we can compute the lower bound and upper bound for the selectivity as follows: lower 
bound = = 0.2, upper bound = = 0.3. 


Range 

# of elements 

1-100 

200 

101-200 

100 

201-300 

400 

301-400 

300 


0 

Many sophisticated query optimisation techniques, such as least expected cost (EEC), 
assume that they have access to probability distributions of parameter values. EEC 
needs this to be able to compute utilities [7]. However, in our case we only have very 
rudimentary statistics, since we do not know anything about the distribution of attribute 
values within a range. The best we can do is to fall back on the assumption of uniform 
distribution, approximating the distribution using a mean value (this is also what least 
specific cost (ESC) optimisation would do in this case). Eor example, applying this 
method to the numbers given in Example 8 would yield a selectivity of 0.225 for the 
predicate X < 126. We compare our minmax regret optimisation technique to a mean- 
value-based approach using SSB data. Additionally, we do a comparison with a simple 
midpoint heuristic, i.e., sorting the intervals in non-decreasing order of their midpoint. 

8.3 Results 

Eirst we present the results obtained studying the different variants of the max-min 
heuristic on the synthetic and Enron data, and then move on to the Star Schema Bench¬ 
mark results. 

8.3.1 Synthetic and Enron Data Sets 

We experimented with a number of operator ordering criteria and initial plans for the 
max-min heuristic. These included starting with an empty initial plan (0), consider¬ 
ing random operator ordering (U), ordering by midpoint (M- and M-t) and ordering by 
width (W- and W-t). We briefly summarise the findings of our experiments here. Over¬ 
all, the Wh- ordering (non-decreasing width) performed best with an overall average 
regret ratio of 1.03 and an overall worst regret ratio of 1.94. W- was often even worse 
than a random order, while Mh- and M- sometimes generated plans whose regret ratio 
was above 3. We also ran a midpoint heuristic that simply ordered the intervals in non¬ 
decreasing order of their midpoints (not going through all max-min scenarios). The 
midpoint heuristic was often worse than running the max-min heuristic with a random 
order. 

While Wh- ordering performs better than the Mh-, M-, and W- max-min heuristics 
and the midpoint heuristic, it is still not significantly better than the random ordering. 
In a second phase of our evaluation we seeded our heuristic with an initial plan. The 
results for initial plan D;CW with operator ordering Wh- were best (D:CW stands for the 
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Figure 4; Results for synthetic data set. 

largest subset of dominant operators, and, in case of a tie, the one with the greatest total 
width of the operators) in terms of the percentage of exact solutions and the average 
regret ratio. The results for the worst case regret ratio were rather inconclusive, so we 
tried to improve on this by running multiple phases of our heuristic. 

Figures 4(a), (b) and (c) show the results for running our heuristic multiple times. 
This means that we take the output of tTinning one phase of our heuristic and use it 
as the initial plan for the next phase. The figures show the results for starting off by 
running (D:CW, W+) first and then executing two more phases. 

As can be seen, this variant clearly outperforms the baseline algorithm (0,U), the 
midpoint heuristic, and the other variants in all respects. For example, for 10 operators, 
the worst regret ratio is less than 1.23 and the average ratio is approximately 1.01, 
compared to approximately 1.94 and 1.08, respectively, for running only a single phase 
of the heuristic. Moreover, running one additional phase improves the quality of the 
generated plan significantly, but running another phase makes almost no difference. 

Figure 4(d) shows the run time of the W+ ordering variant (single and multiple 
phases) together with the baseline algorithm (0,U) when generating plans for up to 
200 operators. Unsurprisingly, the variants midpoint, (0,U), and (D;CW,W+) have 
the fastest run times, as they only sort a set of operators or execute a single opera¬ 
tor insertion phase. Furthermore, it can be clearly seen that the additional run time 
of (((D:CW,Wh-),Wh-),Wh-) does not pay off, since it produces plans that are only 
marginally better than those of ((D:CW,Wh-),Wh-). 
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Figure 5; Results for Enron data set. 

The results on the Enron data set (Eigure 5) showed similar trends^, but were more 
impressive in every respect. The two- and three-phase variants of the max-min heuristic 
found the minmax optimal solution in 84% of cases, had a worst regret ratio of only 
1.05, and an average regret ratio of less than 1.001. By contrast, the midpoint heuristic 
had a worst regret ratio of over 1.49, an average of 1.06, and did not find a single 
minmax optimal solution with 10 operators 

To highlight how bad a poor choice of selectivity can be, we also tested using the 
minimum selectivity values of the intervals (as would be done if estimates were based 
simply on the selectivity of the keywords themselves). This produced a worst case 
regret ratio of almost 30 for only 5 operators. 


8.3.2 Star Schema Benchmark 

We optimised the generated SSB queries using minmax regret optimisation, a mean- 
value-based approach, and also computed the optimal execution plan using exact selec- 
tivities, which means that we are comparing actual query plan costs rather than regret 
ratios. 

Figure 6(a) shows the results for the average difference in costs between the query 
execution plans generated by different methods and the optimal plan (every data point 
in the diagram averages the measurement obtained by running 100 different queries). 
We only include two variants of minmax regret optimisation, (D;CW,Wh-) and the sim- 

^The ran time was exactly the same, which is why we are omitting the diagram here. 
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Figure 6: Results on difference in cost to optimal. 

pie midpoint heuristic, as for SSB no major differences were discernible between the 
different variants in terms of the quality of the query plans. Surprisingly, the midpoint 
heuristic, although not very good at optimising the regret ratio, seems to produce ef¬ 
ficient query execution plans. Considering the fact that all queries had an average run 
time between 60 and 80 seconds, the numbers shown in Figure 6(a) may not seem like 
a big difference. However, this shows that minmax regret optimisation delivers better 
plans than a mean-value-based approach. 

More important is the robustness of the approaches, i.e., how good are they in 
avoiding bad plans? Figure 6(b) shows the standard deviation of the cost difference to 
the optimal plan, illustrating that the mean-value-based approach is more erratic than 
minmax regret optimisation. The most extreme case for all SSB queries was a mean- 
value-optimised plan more than doubling the run time of the optimal plan (from 60s to 
135s), while for minmax regret optimisation the very worst plan added roughly 50% 
more to the cost of the optimal plan (from 60s to 92s). 


9 Conclusion 

We have investigated query optimisation under partial ignorance, in particular order¬ 
ing selection operators optimally if their selectivities are defined by an interval rather 
than an exact value. The strategy we employed, minmax regret optimisation (MRO), 
is considered to be a pessimistic approach compared to other techniques from decision 
theory. In our opinion this makes it well-suited to query optimisation in database sys¬ 
tems, which should be about avoiding bad plans rather than finding the best one. There 
is one major drawback, though: selection ordering becomes NP-hard when applying 
MRO to it. However, we have shown that special cases can be solved efficiently and 
that heuristics can quickly find good solutions. 

For future work we plan to extend our approach to costs described by intervals 
and relative regret, i.e., considering the ratio of the cost of a plan to the optimal plan 
for a scenario rather than the difference. Also interesting are other operators, such as 
joins, whose ordering is heavily influenced by selectivities as well and suffers from 
similar issues: it is hard to obtain exact values. Further topics we would like to tackle 
are finding approximation algorithms with proven bounds and modelling correlation of 
query predicates. Nevertheless, we think this is an important first step in discovering 
new approaches for making query optimisers more robust and one of our medium term 
goals is to build a general framework for query optimisation under partial ignorance. 
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